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^ ' Abstract 

• ■ Formulating a statistical inverse problem as one of inference in a Bayesian model has 

■ great appeal, notably for what this brings in terms of coherence, the interpretability of 

Cy ■ 

regularisation penalties, the integration of all uncertainties, and the principled way in 
which the set-up can be elaborated to encompass broader features of the context, such 
CN ■ as measurement error, indirect observation, etc. The Bayesian formulation comes close 

ly-^ ■ to the way that most scientists intuitively regard the inferential task, and in principle 

' allows the free use of subject knowledge in probabilistic model building. However, 

' in some problems where the solution is not unique, for example in ill-posed inverse 

' problems, it is important to understand the relationship between the chosen Bayesian 

model and the resulting solution. 

Taking emission tomography as a canonical example for study, we present results 
about consistency of the posterior distribution of the reconstruction, and a general 
^ ' method to study convergence of posterior distributions. To study efficiency of Bayesian 

] inference for ill-posed linear inverse problems with constraint, we prove a version of the 

Bernstein-von Mises theorem for nonregular Bayesian models. 
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1 Introduction 



Inverse problems are almost ubiquitous in applied science and technology, and because of the 
need for rigorous analysis to characterise such problems, derive numerical solutions and assess 
their performance - not to mention intrinsic mathematical interest, they have long been the 
subject of intense mathematical study. In the corresponding 'direct problem', (macroscopic, 
global) observational data are predicted from the (microscopic, local) model parameters of 
the system. The inverse problem aims to draw conclusions about model parameters from 
data: it is the home ground of statistical inference in the context of stochastic modelling. 

This paper is a contribution to the theory of inverse problems from a statistical, indeed 
Bayesian, perspective. Motivated by important problems in tomographic reconstruction, 
taken as a canonical example, we consider the asymptotic performance of Bayesian procedures 
in the small-noise limit, for a new class of models that we call generalised linear inverse 
problems, and discuss further opportunities for theoretical analysis. 

In the remainder of this Introductory section, we develop further background for our 
approach, by setting out our perspective on linear/ Gaussian inverse problems. 

1.1 Ill-posed problems and regularisation 

Inverse problems encountered in nature are commonly ill-posed: their solutions fail to satisfy 
at least one of the three desiderata of existing, being unique, and being stable. Thus, in the 
case of linear inverse problems, the focus is not on a unique solution x of 



for given matrix A and data vector y, but rather on the corresponding space of solutions. 

Even when the solution a; to ([T]) exists and is unique for each possible y, lack of stability 
means that the solution can be extremely sensitive to small errors, either in the observed 
y or in numerical computations for solving the equations. This has obvious deleterious 
consequences for the practical value of solutions. To circumvent this, the inverse problem is 
typically regularised, that is, re-formulated to include additional criteria, such as smoothness 
of the solution: 

X = argminy^^3,pen(a;), 

where pen(a;) is a suitable scalar penalty functional. 
If the data is observed with error 

y = Ax + error, 

then, allowing for the possibility of lack of existence or uniqueness, we might replace the 
natural least-squares formulation 



y = Ax, 




X = argmin| \y — Ax 



2 



of the inverse problem by 

a; = argmin|||/ — 74a;||^ + z/pen(a;) (2) 

where v a positive constant determining the trade-off between accuracy and smoothness. 

Such solutions make sense, and are commonly used, whether we regard the error in the 
data used as deterministic or stochastic in nature. The least-squares set up is rather natural, 
but from a statistical perspective corresponds to a Gaussian likelihood, and, as we shall see 
below, this may be replaced by certain other distributions without material change to the 
subsequent analysis. 



1.2 Inverse problems from a Bayesian perspective 

Smoothness, or other 'regular' behaviour of the solution to an inverse problem, is a prior 
assumption on the unknown x, information about the model parameters known or assumed 
before the data are observed. To use such information is thus to accept that the required so- 
lution must combine data with prior information. In a statistical context the best-established 
principle for doing this is the Bayesian paradigm, in which all sources of variation, uncertainty 
and error are quantified using probability. 

From this perspective, the solution to (E]) is immediately recognisable - it is the maximum 
a posteriori (MAP) estimate of x, the mode of its posterior distribution in a Bayesian model 
in which the data y are modelled with a Gaussian distribution with expectation Ax, with 
constant-variance uncorrelated errors, and in which the prior distribution of x has negative 
log-density proportional to pen(x). 

However, the Bayesian perspective brings more than merely a different characterisation 
of a familiar numerical solution. Formulating a statistical inverse problem as one of inference 
in a Bayesian model has great appeal, notably for what this brings in terms of coherence, 
the interpretability of regularisation penalties, the integration of all uncertainties, and the 
principled way in which the set-up can be elaborated to encompass broader features of the 
context, such as measurement error, indirect observation, etc. The Bayesian formulation 
comes close to the way that most scientists intuitively regard the inferential task, and in 
principle allows the free use of subject knowledge in probabilistic model building. For an 
interesting philosophical view on inverse problems, falsification, and the role of Bayesian 



argument, see Tarantola (2006) 



1.3 Convergence of the posterior distribution 

Mathematical analysis of inverse problems usually takes the form of asymptotic arguments 
concerning how well the true solution (the value of x assumed to generate the data) can 
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be recovered in the presence of noise, as the size of that noise goes to zero. In a statistical 
setting, the noise is a random variable, it size might be the variance, and we are concerned 
with convergence of random variables or their distributions - in the case of a Bayesian 
analysis, the focus is on the posterior distribution of x. 

Convergence of the posterior distribution on a finite-dimensional parameter space, with 
identifiable likelihood and with the true parameter being the interior point of the parame- 



ter space, follows from the Doob's martingale convergence theorem (see Doob (1949) , and 



van der Vaart (1998) , for the case the sample size grows to infinity). The rate of convergence 
follows from the Bernstein-von Mises theorem (Ivan der Vaart 1998P which in fact states a 
stronger result, that the posterior distribution centred at the true parameter and rescaled by 
y/n converges to the Gaussian distribution as the sample size n grows to infinity, provided 
the likelihood is identifiable with finite Fisher's information matrix, the prior is continuous 
at the true point and the true value of the parameter is an interior point of the parameter 
space. Moreover, the limit is independent of the choice of the prior distribution. 

However, to the best of our knowledge, the rates of convergence of the posterior distri- 
bution on a finite-dimensional parameter space in the case of non-identifiable likelihood and 
with the true parameter lying on the boundary of the parameter space, studied here, have 
not been considered previously. The particular example of the lack of identifiability of the 
likelihood considered here is the ill-posedness of the linear inverse problem. As we shall see, 
in the case of non-identifiable likelihood, the choice of the prior distribution strongly influ- 
ences the limit of the posterior distribution as well as the rate of convergence on the subspace 
where the likelihood is not identified. Also, we will show that the rate of convergence may 
change if the limiting point lies on the boundary of the parameter space. We shall identify 
the assumptions on the posterior distribution necessary for convergence which can be used 
as a guidance to narrow down the set of potential prior distributions. 

There are different approaches to quantify the convergence rates. One of them is to 
consider the concentration rate of the almost sure convergence of the posterior distribution 
which is the smallest such that 

P((i(x, X*) > £0- I ^) — ^ almost surely 



considered by Ghosal et al. (2000) , Walker (2004) , van der Vaart and van Zanten (2008) and 



Rousseau (2010) in the context of nonparametric models. 



Another approach, considered by Hofinger and Pikkarainen (2007) in the context of linear 



inverse problems, is to metrise weak convergence of the posterior distribution as a random 
variable /ipost(i^) = viA'^i.'^)) using the Ky Fan metric (IFan 19440 : see Section Wf2\ This 
type of convergence is weaker than almost sure convergence, and the convergence rates in this 
metric are slower than the parametric rate with the mean square error loss. In particular, 
there is an extra logarithm in the rate which is unavoidable. 
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The setting for Hofinger and Pikkarainen (2007) is the Gaussian hnear inverse problem in 
the form ([2]), with a particular quadratic penalty (Gaussian prior). Their main result (The- 
orem 11) provides an upper bound on the Ky Fan metric between the posterior distribution 
and its (degenerate) limit, as an explicit function of the size of the noise, the parameters of 
the model and prior, and quantities relating the prior mean to the null space of the matrix 
A. This result is used to prove a limit theorem (Theorem 13) on the convergence of this 
Ky Fan metric to 0, in a small-noise, high-prior-precision limit, and to give the rate of this 
convergence (Theorem 15). 

We adopt the Hofinger and Pikkarainen (2007) paradigm in the present paper, which 
extends their results to a broader class of assumed probability distributions for the data, to 
linear constraints on the solution, and to more general prior distributions, and to the case of 
the solution of the exact linear inverse problem being on the boundary. 

We motivate our study by presenting in Section 2 a nonlinear inverse problem important 
in medical imaging, and in Section 3 a geometrical view of the results in the linear/Gaussian 
case. Section 4 establishes the class of models we study and in Section 5 we formulate our 
theorems on rates of convergence of the posterior distribution. In Section 6 we study local 
behaviour of the posterior distribution in a neighbourhood of the limit that is formulated as 
a version of Bernstein-von Mises theorem. Their proofs are deferred to the Appendix. 



2 Motivation 

A general formulation for nonlinear inverse problems would replace Ax = y hj A{x) = y, for 
some suitably smooth transformation or operator A, together with an assumed probability 
distribution for Y{uj). Mathematical analysis of such problems is typically far more difficult 
and technical than for the linear case. However, a more modest generalisation is enough to 
formulate and analyse a broad range of nonlinear statistical inverse problems of considerable 
practical importance. The model class we consider is formally defined in Section 14. Ij here 
we consider an important motivating example. 



2.1 Single photon emission computed tomography 

Single photon emission computed tomography (SPECT) is a medical imaging technique in 
which a radioactively-labelled substance, known to concentrate in the tissue to be imaged, 
is introduced into the subject. Emitted particles are detected in a device called a gamma 
camera, forming an array of counts. Tomographic reconstruction is the process of inferring 
the spatial pattern of concentration of the radioactive isotope in the tissue from these counts. 
The Poisson linear model 

yt ~ Poisson(ylta;) (3) 
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independently for different t, is close to reality for the SPECT problem (there are some dead- 
time effects and other artifacts in recording). Here x represents the spatial distribution of the 
isotope, typically discretised on a grid, x = {xg}, and y the array of detected photons, also 
discretised y = {yt} by the recording process. The array A = (ats) quantifies the emission, 
transmission, attenuation, decay and recording process; ats is the mean number of photons 
recorded at t per unit concentration at pixel/voxel s. 



See Green (1990) for further detail about the model, and an approach based on EM esti- 
mation for MAP reconstruction of x, in a Bayesian formulation in which spatial smoothness 
of the solution is promoted by using a pairwise difference Markov random field prior. Later, 



Weir (1997) proposed fully Bayesian reconstruction. 

Since Poisson distributions form an exponential family, this model can be seen as a 
generalised linear model ( INelder and Wedderburn 19 72 p . with identity link function, and 
since A is ill-posed we can call this a generalised linear inverse problem. 

We formalise the notion of 'small-noise limit' for this Poisson model in a practically- 
relevant way, by supposing that the exposure time for photon detection is extended by a 
factor T, and then consider the rate of detection of photons, letting T — )■ oo. Thus the 
data-generation model becomes 

yt\Xtruc ~ PoisSOn(TAxtrue)/T, 



independently, for t = 1, 2, . . . , n. To preserve the appearance of results from Hofinger and Pikkarainen (20C 
as far as possible, we write = cr^ — )■ 0. 

2.2 Prior distributions 

From the beginning of Bayesian image analysis (IGeman and Geman 1984j Besag 1986), use 



has been made of prior distributions for image scenes that express generic, qualitative beliefs 
about smoothness, yet do not rule out abrupt changes for real discontinuities (for example, 
at tissue type boundaries in the case of medical imaging). 

In common with much of the literature, we will concentrate here on Markov random 
field prior distributions. The 'true image' Xtme in emission tomography corresponds to a 
physical reality, the discretised spatial distribution of concentration of a radioactive isotope. 
Of course, this is non-negative, so we impose a constraint, written Xtruc & X cMP in general. 

The first prior model we consider is Gaussian, apart from possible truncation by the 
constraint, 

1 



p(x) oc exp |- — ||x - XoIIbJ , xEX, 

where WuW^ = vF Bu and i? is a non-negative definite matrix. An important special case is 
where xq = and B satisfies Bgs' = 1 if s and s' are neighbouring pixels (written s ~ s'). 
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otherwise B^s' = 0. Then we have — Xo||| = J2ir^ji^s—Xs')'^, a pairwise-interaction modeL 
In this and other important cases B is singular. 

A second prior model is a log cosh pairwise-interaction Markov random field (IGreen 19900 : 



oc exp ^-^^^^y^ ^logcosh((xs - Xs')/S)^ , x e X. 

Here the parameter 5 is considered to be fixed. 

This model has some attractive properties. While giving less penalty to large abrupt 
changes in x compared to the Gaussian, it remains log-concave. It bridges the extremes 
5 — > oo, the Gaussian model just mentioned, and 5 = 0, the corresponding Laplace pairwise- 
interaction model, sometimes called the 'median prior'. 

These distributions are improper since they are invariant to perturbing x by an arbitrary 
additive constant, but lead to proper posterior distributions, save in exceptional pathological 
circumstances. 



3 Geometrical perspective 

In this paper, we study inference for x given observed in the limit as a noise parameter 
(in the SPECT example, 1/T) goes to 0. We generally assume an identity link function, so 
that y becomes concentrated on Axtrue as o"^ — )■ 0. 

Because of the ill-posed/ill-conditioned character of the problem, we cannot expect con- 
sistency in inference about Xtrue based on the likelihood alone. Even as — )■ 0, so that 
y converges to 'exact data' ?/exact = ^a^true, we will not be able to distinguish between 

{x : Ax = AXtxnc} ■ 

One of the roles of the prior in the Bayesian approach is to resolve this ambiguity (as well 
as generally improve reconstruction through 'regularisation', even without cr^ — )■ 0). We recall 
the 'physical' constraint in the SPECT problem, that x is component-wise non-negative, that 
is, X G A' C since it quantifies the isotope concentration. 

Insight into the interplay between the possibly ill-posed likelihood and the possibly de- 
generate prior, and the role of the constraint x E X can be obtained from a geometrical view 
of the problem. 



3.1 Gaussian likelihood and prior 

Here we focus on the Gaussian prior p(x) oc exp(— 1/(27^) ||x — xo||b) and Gaussian likelihood 
y\x ~ {Ax , a"^ I) . This is the setting of Hofinger and Pikkarainen (2007), except that we 
will allow B to differ from the identity and even be singular. 
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In the limit as cr^ — )■ 0, we are interested in solutions of Ax = z/cxact, where yexact = ^a;true, 
under the influence of the prior p{x) oc exp(— 1 / (27^) | \x — a^ol To obtain convergence to a 
degenerate limit, we will need 7^ — > as well (though, as shown by Hofinger and Pikkarainen (2007) 
for the case 5 = /, at a slower rate than a^). 

Thus the posterior is proportional to 

exp{-l/{2a'^)\\y - AxW^ - Ij {2-i'^)\\x - Xq\^b) subject io x e X . 

Let us first ignore any constraint on x, or equivalently assume X = WP . By standard 
manipulations, we can write this posterior as 

x\y ~ M {{A^A + vB)-^{A^y + vBx^), a^{A^A + vBY^) , (4) 

assuming the inverse matrix exists. As cr^ and 7^ — )■ in such a way that v = cr^/7^ — ?■ 0, 
the posterior converges to the point 

X* = argmin^g;(.^^^^j^_^^ 1 1^^ - a^ol || (5) 

Suppose that A is a real n x p matrix, and B a real symmetric non-negative definite 
p X p matrix, both possibly of deficient rank. A rank condition is needed to ensure that the 
information from the likelihood and prior together define a proper posterior, and determine 
X* uniquely. 

Proposition 1. Suppose that A is a real nxp matrix, and B a real symmetric non-negative 
definite pxp matrix, both possibly of deficient rank. Suppose also that the px2p block matrix 
[B : A^A] has full rank p (or equivalently, the rows are linearly independent). Then for all 
u > 0, A^A + uB is nonsingular. 

It follows that there exists a nonsingular real matrix P, not necessarily orthogonal, such 
that P^BP, P'^A^AP, and P'^{A^A + vB)P (for all u > 0) are all diagonal. 

Furthermore, the limit as u of v{A^ A + vB)^^ is a well-defined finite non-negative 
definite matrix C , and v{A^ A + vB)^^ — C = 0{v). 

The proof is in Appendix A.l. 

This last result gives us a full description of the posterior variance matrix as — )■ 0, 
7^ — while V = cr^/'y'^ — )■ 0: recalling that the posterior variance (in the Gaussian case) 
is a'^{A'^A + uB)^^, we see that in the limit, those components corresponding to = +00 
scale as 7^ and the remaining ones as a^. (This is before transformation by P, which scales 
and skews the result, but in a way independent of 7^ and a^.) We see from the fact that 
P^A^AP = diag(l/(l + ajt'o)) that the number of not equal to +00 is just the rank of 
A^A. 

In summary, the posterior distribution is Gaussian, with variance scaling differently in 
different directions. If q is the rank of A, then asymptotically the variance has q eigenvalues 
scaling like o"^ and the remaining (p — q) like the (larger) 7^. Geometrically, contours of equal 
posterior density are concentric ellipsoids in MP. 
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3.2 Constrained case, and KKT theory 

When A* is a proper subset of MP, the concentric elhpsoids are truncated by the constraints 
X E X. In the case of interest in SPECT, where we have simply componentwise non- 
negativity contraints, the elhpsoids are truncated into the non-negative orthant. As cr^ and 
7^ become small, there are clear qualitative differences in the impact of this truncation 
according to whether the centre [A^A + vB)^^{A^y + vBxq) of the ellipsoid lies in the 
interior of the orthant, on its boundary, or outside it. Since y becomes close to T/exact as 
(7^ — )■ 0, in the limit this is the same as the question of where does x* lie. 

Equation (jS]) is a quadratic programming problem, and could be solved numerically by 
standard software. 

We can get a theoretical handle on the solution through Karush-Kuhn-Tucker theory 
( IKuhn and Tucker 195ip . In the non- negativity constrained case, X = M.^, to minimise 



||a; — Xoll^ subject to x > and Ax = ^cxact it is necessary and sufficient to find {x*,fi, A) G 
MP X X M" such that 

B{x* - xo) - /i + A^X = 
X* > 

Ax 2/cxact 

^ > 

for all s, /is = or X* = 




Figure 1: Illustrating the geometry in the case p = 2, n = 1, with B = I. Contours of 
posterior when 7^ > o"^ > 0. 

The feasible set X* = {x E X : Ax = y exact} is a closed convex set, and x* may be an 
interior point, or satisfy one or more of the constraints x^ > 0. In the case where all entries 
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of A are non-negative (in accordance with physical reahty), and for each s there is at least 
one t with Ats > (and if not, then Xg is unidentifiable, so might as well be omitted from 
the model), X* is a bounded polyhedron (or polytope). Otherwise, X* may be unbounded. 

If 7^ remains bounded away from as — ?■ 0, then, in the limit, the posterior has 
support X*. 

3.3 Geometry in a more general case 

The form of ([5]) strongly suggests that analogous properties for the limit of the posterior 
should hold in a much broader class of models. Provided that and 7^ — )■ in such 

a way that u = cr'^/'j'^ — > 0, we would expect similar limiting behaviour so long as (a) the 
likelihood is maximised on {x : Ax = Uexact} as cr — )■ 0, and (b) the prior becomes close to 
Gaussian as 7 — )■ 0; subject to these the precise form of the likelihood and prior should be 
irrelevant. These observations motivate the model formulation of the next section. 

In a general setting, more delicate, analytic, arguments will be needed to quantify the 
convergence precisely, and these are given in the following sections. However, the broad 
qualitative features of the solution for the Gaussian-Gaussian case (Section 13. ip continue to 
hold: the posterior becomes increasingly concentrated near the hyperplane {x : Ax = i/exact}, 
with 0"^ dominating its squared variation about this hyperplane, while the variance parallel 
to the hyperplane is of order 7^. The effect of the truncation onto x ^ X depends on whether 
in the absence of the constraint, the maximum of the posterior would lie in the interior of 
X ^ on its boundary, or outside it. 

4 Model formulation and preliminaries 

4.1 General Bayesian model 

We assume that the joint density of the observable responses Y taking values in 3^ C 
(with respect to Lebesgue or counting measure) takes the form 



that is, that the distribution depends oyv x ^ X only via Ax^ where r is a scalar dispersion 
parameter; in the Gaussian model, r is the variance a^. The observed data are generated 
from this distribution, with x = Xtme, and we aim to recover Xtrue as r — )• 0. 

We assume a continuous bijective link function G : 3^ — t- M" and write G(yexact) = Axtme- 
(In generalised linear models - see Example 3 below - commonly G has identical component 
functions.) 




(6) 
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We adopt a Bayesian paradigm, using a prior distribution with density given by 

oc exp(^(a;)/7^), xeXcW, (7) 

where 7^ is a scalar dispersion parameter for the prior that may depend on r; we relate this 
to the data dispersion parameter r by 7^ = r/z/, and express most of our results below in 
terms of r and u. Thus the posterior distribution satisfies 

p{x\y) (X exp{[fy{Ax) + u g{x)]/T), xeX, (8) 

Denote fy{x) = fy{Ax) and hy{x) = —fy{x) — u g{x), so that p{x\ y) oc e"''^^^)/"'. 
We make the following assumptions about the error distribution: 

p 

1. If r ~ F{y, ^(ycxact), t), then Y ?/exact as r ^ 0. 

2. ForallyUo G G^^{AX), f^^Xv) has a unique maximum over AAf at = G{^q), Vr^f^g{G{fio)) = 
and V^f^g{G{fio)) is of full rank. 

(Throughout, we use Vi = ^ as the differentiating operator, and V = (Vi,...,Vp)^ as 
the gradient. Similarly, Vjj and Vjjfe are operators of the second and third derivatives, with 
= (Vjj) being the matrix of second derivatives.) 

Assumption 2 implies that the likelihood is regular in Ax. Various conditions are sufficient 
for a distribution to satisfy this assumption, for example, the following: 

2t. G{EY) = Ax and 3a > and v{t): Vi, E{\Yi - EF,|") ^ v{t) such that v{t) ^ as 
r ^ 0. 



Then, a variant of Chebyshev's inequality implies convergence. 

Example 1. Assumption 2 is satisfied even for a location- scale Cauchy distribution ti{fi,a) 
with, say, a = 1/2: 

E\Y - ^^\'/' = r ^^dx, 

which is finite and goes to as a 0. However, assumption 1 is not satisfied for the Cauchy 
distribution (or indeed any rescaled/recentered distribution with polynomial decay) since the 
density cannot be cast in the form ^ for any choice of r. 

Example 2. Both assumptions are satisfied for the power exponential (Subbotin) distribu- 
tionsF{y,fi,a) = C^,pexp{-[{y-fi)Y^Vcr^} (P > 0), withr = and fy{fi) = [{y-^,)^/^. 



Example 3. In the generalised linear models of Nelder and Wedderburn (1972), an impor- 



tant class of nonlinear statistical regression problems, responses yt, t = 1,2, ... ,n are drawn 
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independently from a one-parameter exponential family of distributions in canonical form, 
with density or probability function 



PiVf, IJ't, r) = exp h d{yt, r] 

using the mean parameterisation, for appropriate functions h, c and d characterising the 
particular distribution family. The parameter t is a common dispersion parameter shared by 
all responses. The expectation of this distribution is 'K{yt; fit,^) = = c'{fit)/b'{fj^t). Both 
assumptions are satisfied for this example. 

As the link function G is continuous and monotonic, we could consider a linear inverse 
problem Ax = ^cxact where ^exact = (^(^exact), Y = G{Y) and y = G{y). The expressions 
with respect to x do not change, however, the Ky Fan distance pKiX, 2/exact) is replaced with 
Pk{Y, ^cxact)- Hence, to simplify the notation, we assume below that the link function is the 
identity. 

We will assume that X = [0, oo)''. We could assume that parameter x is restricted 
to an arbitrary convex polyhedron; this could be reduced to [0, oo)p by a linear change of 
variables. In fact, the results below apply to X such that for any x G X, [B{x,6)r\X — x]/r — )■ 
R'' X [0, oo)™ X (-00, OjP-'^-'" as r, 5 ^ and 6/t oo. 

4.2 Metrics for quantifying convergence 

Definition 1. The Ky Fan metric between two random variables and ^2 i'^ « metric space 
{y, dy) is defined by 

PK(ei,6) = mf{£ > : ¥{dy{Uu),U^)) > e) < e}. 



Convergence in this metric is equivalent to convergence in probability ( [Dudley 2003D . 
Hence, weak convergence of the posterior distribution /ipost (as a random variable) to S^*, 
the point mass at x*, is equivalent to its convergence in the Ky Fan metric, where the metric 
space (3^, dy) is a space of probability distributions equipped with the Prokhorov metric. 

Definition 2. The Prokhorov metric between two measures on a metric space {X,d;^) is 
defined by 

pp{pi,P2) = inf{e > : pi{B) < P2{B'') + eM Borel B} 
where B'^ = {x : iniz^B dx{x, z) < e}. 

In particular, convergence in this metric is equivalent to convergence in distribution 



( [Dudley 2003D , and so weak convergence of the posterior distribution can be studied as 
convergence of the Ky Fan metric pxl/^post, ^x*)- 
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4.3 Boundary and local geometry 

Now we describe the local geometry of the posterior distribution around the point x* where 

X* = argmax,g^^^,^j^_^^(x). 

We assume that the prior distribution is such that x* is a unique solution. We relax the 
assumption that x* is a regular point, by allowing it to lie on the boundary of X. 

The definition above implies that if x* is an interior point of {x G A" : Ax = i/exact}, then 

= [^9ix* +(1- Pat)z)U=oJ = (/ - PAT)Vgix'), (9) 

where P4T is the projection on the range of A. However, if x* is on the boundary, the 
gradient Vg{x*) may not be zero. In case of a truncated distribution of X this corresponds 
to the maximum lying outside of X (see Section [3] for further insight into the geometry of 
the boundary). Denote the set of coordinates where this vector is non-zero by 

S = {z: [Vg{x*)], ^ 0}, 

and the projection on the S coordinates by Ps, i.e. {Ps)ii = 1 if i G 5* and {Ps)ij = for all 
other We assume that 

rank ^ [A'^ '■ Ps]^ = rank(A) + rank(P5) 

(where [A'^ : Ps] is the block matrix putting A^ and Ps side by side); this simply prevents 
degeneracy. 

We consider the spaces A = {v : Av = 0} and C = {v : Psv = 0} (the null spaces of A 
and Ps), and their intersection. Let ro = rank(yl), r2 = rank(C) and set ri = p — — r2 (so 
that Tq + Ti + r2 = p) ■ By the rank-nullity theorem, and using the rank assumption above, 
the dimensions of A, C and AnC are respectively ri + r2, Tq + ri and ri. 

Define three vector spaces by Vo = Cfl (^flC)^, Vi = ^flC, V2 = (^nC)-^; here, the 
orthogonal complements in the definitions of Vq and V2 are with respect to the same fixed 
choice of inner product as the projection operators defined earlier. It follows that Vi has 
dimension pi,i = 0, 1, 2, MP = Vq © Vi © V2, and we can write any v = x — x* inMP uniquely 
in the form v = Vq + Vi + V2, where Vi G Vj, i = 0, 1, 2. 

In the regular, non-boundary case where x* is in the interior of X, then S is empty, Ps 
has rank 0, ro = rank(A), ri = p — rank(A) and r2 = 0. 

Now consider the following projections and compositions of projections: 

Qo = {I-Ps)PaT:Ps, Qi = {I-PaT:Ps), Q2 = iI-PAT)P[AT:Ps]. 

For each i = 0,1,2, Qi is diagonalisable, and we can let f/j denote a p x pi matrix of 
orthogonal eigenvectors corresponding to the non-zero eigenvalues where pi = rank(Qi). 
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Note that p2 — r2 but po < (and hence pi > ri); po — '"o only if matrix A^Vy^^^^^{x*)A is 
of the same rank as matrix A^A. 
Now let U be the block matrix 

U=[Uo:Ui: U2] . 

Note that U is not orthogonal - the centre block is orthogonal to the first and last block, but 
the first and last blocks are not orthogonal to each other. These three blocks form bases for 
Vj, i = 0, 1, 2, and U itself is a basis for MP. Any v — x — x* inMP can be expressed uniquely 
as V — Uw, and and w can be partitioned conformably as 



( Wo\ 



and w = 



where iJJ is a pj x p matrix, Wi e 



and UiWi 



\W2/ 

Vi G V, for each i. We will also denote 



01 



, ^01 = [Uq •■ Ui] and poi = Po +Pi- In particular, {U ^)^Uk = Ip,^ for 



A; = 0, 1, 2 and {U Uqi = /po+pi- 

Introducing the matrices of second derivatives: 



Vyix) 



-^'fy{Ax), 



B{x) = -V'g{x), 

Hy{x) = V'^hy{x)^ A^Vy{x)A + vB{x), 

the following quantities will be used in approximating the posterior distribution: 

HHoifii — UoiHy{x*)UQi, 

xo = HH-]o^Vhy{x*), 

b = \U2Vg{x'')\, 

bm\n = minL. 



(10) 

(11) 

(12) 

(13) 



4.4 Quadratic approximation of \ogp{x \ y) 

We approximate hy{x) — —T\ogp{x | y) by a quadratic function of x on B{x*,5) for an 
appropriate 5 using Taylor expansion: 

hy{x) = hy{X*) + {X- X*fVhy{X*) + ^{X " X*fV^hy{X*){X " x^ ) + Ah{6) . 

We will need the following assumptions. Introduce the following neighbourhood of yexact in 

y: 

yioc ^{y ey : \\y- yexact 1 1 < Pk{Y, yexact) }■ 
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By the definition of pk{Y, yexact), P(^ioc) > 1 - PkIY, 
Assume that 

1. 3to > 0: Vr ^ to,J^ e'^'^^^^/'^dx < oo for all y e y; 

2. fy,geC^(B(x*,5)) for ally e:^; 

3. 3C/^3, Cg^3 < oo such that for all x G B{x*, 6), for all y G 3^ioc and all 1 ^ i, j, k ^ p, 

\^ijkfy(x)\ ^ Cf,„ (14) 

\Vijkg{x)\ ^ Cg,s; (15) 

4. (a) 3M/^o, -^/,2 < oo such that for all 1 ^ ji, . . . , ^ p with o? = 0, 1, 2, and 

for all y G :Vioc, 

|V,i,...jJy(a;'^) - V,,„„,,j3,_,(a;'^)| ^ Mf^^Wy - l/exact||; (16) 
(b) 3M/_3 < oo such that for all x G S(x*, 6) and all 1 ^ i,j,k ^ p and for y G J^ioc, 

I ^ijkfy {x) - Vijkfy,^^, (x) I ^ M/, 3 1 1 y - ye^act 1 1 • (17) 

The last two assumptions are satisfied if V'^/juo(a;) is differentiable in //q and this derivative 
is bounded on ^loc, with 

M^,, = sup |V,V^/,(x*)| for (i = 0,l,2, 

M;,3 = sup sup |V^V^^(x)|. 

We choose 5 such that the approximation error A/j((5) of hy{x) goes to 0, and that the 
integral of e'^'^'^^^l'^ over X \ B{x*,S) is negligible compared to the integral over B{x*,6). 
Assume that S > satisfies the following conditions as r — > 0: 

S > \ \xo\\, <IC ||a:^o||, """^ oi^oi) ^ ^ with high probability, (18) 

T 

5^0, >0, > oo. 

T T 

4.5 Choice of 6 

Consider the integral of e~^y'^^^/'^ over B{x*, S). 

Lemma 1. Under the assumptions on fy, g and 5, and assuming that HHq^ qi exists, 
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See Proposition [21 in the Appendix, for further details and the proof. 

Now we need to choose 6 satisfying the conditions of the lemma such that the integral over 
the remaining space X \ B{x*, 6) is negligibly small compared to the integral over B{x*, 6) 
for small r, i.e. that 

r Q~[hyix)-hy{x*)]/T^^ 
J B(x* ,0) 

f p-[hy{x)-K{x*)]/'<- (jr 
^ JX\B{x\S)^ fl+0fl)l 

^(p+P2)/2^-P2e-J^^^^oi.oixo/(2r)(27r)m/2[det(Moi,oi)]-'/2 Ei 
= o(l) as r — 7- 

under the assumptions of Lemma [H This condition (fT9l) is satisfied, for instance, under the 
following assumptions on hy. 

Assume that there exists a function g > such that for x G X\B{x*, 6), hy{x) — hy{x*) > 
g(| |x — x*| I), and g(r) > cr" for r > 5 and some c = c{6) > 0, a G (0, 3). 

Then, it is sufficient to choose 6 satisfying 

POO 

JcS"/t 

which is satisfied, if c{5) x const, e.g. with 5 = [— r logr]^/'-^^"'"")") for some a > 0. In this 
case, it is possible to choose 6 satisfying conditions ( ITSl) for appropriate a and u if a < 3. 
Therefore, we can use the following lemma. 

Lemma 2. Assume that there exists a function q : M+ — t- M+ such that for ||x — a;*|| > 6, 
hy{x) — hy{x*) > q{\\x — x*\\) , and q{r) > cr" for some c = c{6) > and a G (0, 3). 
Then, for 6 satisfying / figj) and ^2^), 



-{hy(xyhy(x*))/r^^ = [1 + o(l)] / e"('^^(^)-''^(^'*))/^da; 

X Jb{x*,5) 

as r — 7- 0. 

In particular, if c{5) x const, the above condition is satisfied with 5 = [— r logr]^/*^*^^+"^"^ 
for any a > 0. 

If we assume that g(r) > clogr, c > t(j9+ 1), then, to have /g^-^* ^-j Q-{K{^)-Ki^*))/'<' dx ^ 
Ix\B(x* 5) e"(^!'(^)"''!'(''*)^/^(ia;, 6 must satisfy 

(5'=/^-P(c/r - p) > Y[[bi]-\2ny^''/^[det{HHoi,oi)]-^^^ 
or, equivalently, 

^ exp < T h r log r- 



c ° 2c 

where C = n[&i]"H27r)^o^/^[det(i/iJoi,oi)]"^^^- If c{6) is a constant independent of 5, the 
fourth condition of (ITSl) . that (5 — )■ 0, is not satisfied. 
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5 Rates of convergence of posterior distribution in Ky 
Fan metric 



As we have seen in Section [31 the rate of contraction of the posterior distribution (in terms 
of Ky Fan distance) varies between PatX and (/ — Pxr)X and is determined by the second 
order behaviour of the logarithm of the posterior density. We shall also show below that, if 
X* is on the boundary, the contraction rate in Ps{I — Pat)X is different and is determined 
by the first order asymptotics. 

Denote by /ipost(i^) the posterior distribution of X given y = Y{u). We consider the metric 
space {X, £2) equipped with the Euclidean metric | |x— z| | = ^/Yl^=ii^-i ~ ^i?'-, ^ C W . Then, 
the posterior measure /ipost(i^) can be viewed as a measure on the metric space (A*, £2)- The 
corresponding metric space for the observations is (3^,^2)5 y C M" equipped with metric 
generated by £2 norm. 

In the next section we evaluate the level of concentration of the posterior distribution 
/ipost around x*. We start with the concentration of the posterior distribution /ipost(w) for a 
fixed bj (i.e. for a particular data set) in the Prokhorov metric, and then, using the lifting 
theorem (Theorem [3l), we use bounds thus obtained to derived a bound on the Ky Fan 
distance between the posterior distribution and the limit over all lo. We consider separately 
the cases where x* is an interior point of X and where it is on the boundary oi X. In the 
results below, it is assumed that the dimension p is fixed and is independent of r. 

Throughout this section we use the error Ao(-B(0,5)) defined by ( |T9l) . and constants 
and Cp defined by that feature in the upper bound on the Ky Fan metric between the 
Gaussian distribution and its mean (Lemma [3] in the Appendix). 



5.1 Prokhorov distance, fixed uj 

Define Aminpos(^) to be the minimum positive eigenvalue of a matrix M, and Amm,p(^) = 
min||^ll=i^P^,=i, I |Mf 1 1 to be the smallest eigenvalue of a matrix M on the range of a projection 
matrix P. 



Theorem 1. Suppose we have a Bayesian model given in Section\4jJ\ and let the assumptions 
stated in Section \4.4\ hold. 

Assume that (/ — Pat)V g{x*) = and [A^Vy{ui){x'^)A : B{x*)] is of full rank. 

Then, 3ro > > such that for Vr G (0, tq], 



Pp(Aipost(w),^z*) ^ max 



2Ao M/i||F(u;) -yexacti I +1^11^4^ V(?(x* 



1 ~l~ ^0 Ajniii pos 

(A^VVm(x'^)A) 



Amm('^) V \Amin('^) 



Cplog — ^ (1 + A,(5,0,r(u;))) , (21) 
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where Ainin(i^) = Amin(iJy(^)(x*)) and is defined by [3_ 



The first term in the sum represents the bias of the posterior distribution, and the second 
term is the Prokhorov distance between A/'(0, rify(^)(x*)~^) and the point mass at zero. 
The maximum reflects the fact that there are two "competing" tails: Gaussian on the ball 
B{x*, 6) and the tail of the posterior distribution outside the ball. 

This theorem implies that to have convergence of the posterior distribution to 6^*, we 
must have (a) convergence of the data so that ||y — ?/cxact|| 0, (b) = — )■ 0, i.e. 
the prior distribution needs to be rescaled in a way dependent on the scale of the likelihood, 
and (c) T / \^in{HY{uj){x*)) — )■ 0. If the matrix A'^Vy{uj){x*)A is of full rank, then, for small 
r, Amin(-ffy(cj)(a;*)) is close to the constant Amin(A"^K^^^^^t (^*)^) with high probability, hence 
the latter condition is satisfied as r — t- 0. However, if A'^Vy{uj){x*)A is not of full rank, 
then, for small enough z/ and r, Amin(-f^y((^)(a^*)) = '^Amin.i-p^T (-^(^*))! hence, we must have 
t/u = 7^ 0. 

This is summarised in the following corollary. 

Corollary 1. For weak convergence of the posterior distribution to the point mass at x* as 
r — 7- for a fixed u, we must have = — 0. 

1. If the matrix y4^VV(tj)(a;*)^ is not of full rank, then we must also have 7 — ?■ 0. 

2. If the matrix A'^VY(^aj){x*)A is of full rank, however, the scale of the prior distribution 
7 may be taken a positive constant. 

Now we consider the case where x* is a boundary point of X and (/ — PAT)'Vg{x*) 7^ 0. 

Theorem 2. Suppose we assume the Bayesian model defined in Section ^.i, and let the 
assumptions on fy, g and 6 stated in Section 4^ hold. 

Assume that Uoi[A^Vy(u)){x'^)A : B{x*)]Uqi is of full rank. 

Then, 3tq > such that for Vr G (0, Tq] and small enough t/u and for any a G (0, 1), 

. / w ^ / / 2Ao Mfi\\Y{u)-y,^,,t\\ + J^\\PAT V^(a:^)|l r ^ r. 

pp /Xpost w , 5^0 ^ max < ^ , . , -^—^ ... .TT/ / *\attt\ [a + Vl - a^] 

r a2 N 



+ W-T —] )(l + A.(5,p2,rM)) 

^ \og(—^^=]il + A^^i5,p„Yiu)))}, (22) 



where Amin,oi('^) = Amm.Uoi (-f^y(a;)(a;*)) and A^ and A*^ are defined by (3^. 

To have convergence of the posterior distribution to 6^* for a fixed u when x* is on 
the boundary, a similar argument as in the case when x* is an interior point applies (with 
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^'^K/exact(a;*)^ and Hy{u){x*) replaced by UoiA'^Vy^^^^^{x*)AUl^ and UoiHy(uj){x*)UI^ respec- 
tively), but also we must have r/z/ = 7^ — )■ 0. Hence, in this case to have the convergence 
we must assume that z/ = — )■ and 7 — as r — )■ 0. 

The value of a balances the parts of the Prokhorov distance attributable to Gaussian 
and exponential tails on B{x*, 6) (see the proof of the theorem for details). In the ill-posed 
case, the Gaussian Prokhorov rate is slower than the exponential one, hence we can choose 
a in such a way that a ~ 1 and log (— 7^/Vl — (^^) — )■ 00 as 7 — )■ 0, e.g. we can choose 
1 — = 7^, i.e. a = a/1 — 7^. If UoiA'^Vy{x*')AUQi is of full rank, then the exponential rate 
—7^ log 7 is slower than the Gaussian one t^/— log r, hence we can choose a ^ such that 
the rate — log(a^r) remains slower, e.g. a = t^^^. 

These theorems give an upper bound on the Prokhorov distance between the posterior 
distribution and the limit for any particular instance of observed data Y{u). To "lift" the 
result obtained to a bound on the Ky Fan distance over all u, we use the following gener- 



alisation of the lifting theorem of Hofinger and Pikkarainen (2007) to the case of different 
bounds for different outcomes u. 

Theorem 3. Let random variables Xi, X2 and Yi, Y2 be defined on the same probability 
space (f2, J-", P) with values in metric spaces {X,dx) and {Y,dy), respectively, and suppose the 
sample space Q is partitioned into two parts, Q = QiU Q2, ^1 Ci ^2 = ^■ 
Assume that there exist positive nondecreasing functions $1 and ^2- 

Vu; e fifc, d,iX,{Lo),X2ico)) ^ $fc(rf,(Fi(a;), ^2(0;))), k = 1,2 

i. e. we have different upper bounds on Qi and Q2 ■ 
Then, the following inequalities hold: 

Pk{Xi,X2) ^ max{pK{YuY2) + P{n2),^i{pK{Yi,Y2))}, 
Pk{Xi,X2) ^ ma.x{pK{Y,,Y2),<l>i{pK{YuY2)),<l>2{pK{YuY2))}. 

In our case, {X, d^) is the space of all distributions equipped with the Prokhorov metric, 
and (y, dy) is the metric space y with the £2 metric. Theorems [T] and [2] provide an upper 
bound $1 on the event Vti where a random matrix Hy{u]){x*) (or UqiHy{uj){x*)Uq]) is of 
full rank, and the first statement of the theorem is applied to obtain the Ky Fan rate of 
convergence. 

5.2 Consistency of the posterior distribution 

Applying the lifting inequalities given in Theorem [3] together with Theorem [H we obtain the 
following bound on the Ky Fan distance. 
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Denote 



Vrain = mill 'K/exact « (s^^) , (23) 

= (24) 

(UoiATAU^,) ' ' 

\\Pat^9{x*)\\ 
^min-^minjpos (f/oiA^Af/o^i)' 

and, for small enough Pk(>^, 2/exact), 

, A; = 1,2. (25) 



Cfc = Cfc 



^ _ Mj2PK(y,l/cxact) 
■^min pos 



Theorem 4. Suppose we assume the Bayesian model defined in Section 4-1 • o.nd that the 
assumptions on fy, g and 6 stated in Section 4^ hold. 

Assume that x* is an interior point of X and that [A^Vy^^^^^{x*)A : i?(x*)] is of full rank. 

Assume that as r 0, \rain{,Hu) > PkC^^, 1/exact), where = A'^Vy^^^^,{x*)A + uB{x*) 
and that 



A* < max { u, pk{Y, i/exact), \I-T log Tj-^ ) } , (26) 



where Aq is defined by /[3y]) . 

Then, 3tq > such that for\/T G (0,ro], and small enough v and t/v, 

PK(/ipost,5x-) ^ max{2pK(^,2/cxact),ClPK(^,yexact) + C2l^ (27) 



+ 



log a 



^ X -\ 1/2 



1 + A,,A'(5,0)) 



where Ci and C2 defined by ^2^) with Uqi = I, A^_^(5, P2) is defined by [4^ . 

Under the assumptions on t, v and b given in Section [^T^ A^^x(5, 0) = o(l) as r — ?■ 0. 

Assumption Amin(ifj,) ^ Pk(^, 1/exact) is necessary so that Amin(-ffy(a;)(a;*)) can be bounded 
from below by a positive value with high probability. In the well-posed case, this holds with 
high probability; in the ill-posed case, this holds \i v ^ Pk(^, 2/exact)- In the Gaussian or 
Poisson cases, where Pk(^5 ?/exact) = a/^, this means that we assume that -\/r/7^ ~^ 



Theorem 5. Suppose we assume the Bayesian model defined in Section \J^.1\ and that the 
assumptions on fy and g stated in Section \4.4\ hold. 

Denote \mm,oi = Amin,Uoi (^i^); '^^crc = A^Vy^^^^^{x*)A + uB{x*). 

Assume that 

• UoM^Vy^....ix*)A : B{x*)]U^, ts of full rank, 
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prior dispersion 7^ satisfies ^ — )■ 0, 

for any a G (0, 1) that may depend on r and 7, 



r , I T \ T 



A* < max<( z/, ^/-- log , log 

Amin.Ol \Amin,01/ ^ \Z/Vl — 



where Aq zs defined by /[3y]) . 
Then, for small enough t, 



PK (/ipost , S^* ) ^ max <^ 2pK {Y, ?/exact 



post, "a*; iiio,^ ^ ^/^Kv-" 5 ycxact;, jrpj ^pf^j TTnTrT^ 



Amin,pos(t/01^^K/exact(a;*)^f^( 



+ \/-T slog T ) +logCp )(1 + A.,,^(5,P2)) 

^min,01 V V^min,01, 



7^ , / 1^P2 



log , (1 + A..,^(5,p2)) , (28) 

which holds for any a G (0, 1) where A^,^k{S,P2) o-nd A^^,^k{S,P2) o'^e defined by ( fTTP , Ci and 
C2 ore defined by (d^. 

Under the assumptions on r, v and 8 given in Section 4-4 • ^*,k{8,P2) = and 
A^^^k{S,P2) = 0(1) as T ^ 0. 

Hence, in the case that the solution is on the boundary, the competing rates of convergence 
are the Ky Fan distance for the data Pk(X, y exact) and the rate of convergence of the posterior 
distribution. 

Recall that in the ill-posed case (if UQiA'^Vy^^^^^{x*)AUoi is not of full rank), Amin.oi ^ 
V ■ const, and in the well-posed case Amin,oi ^ const. 

5.3 Convergence of the data in Ky Fan metric 
5.3.1 Examples 

Now we consider some examples. 

Corollary 2. Let the assumptions of Theorem^ on the prior distribution hold, and Yt be 
independent rescaled Poisson random variables, with z/ = r/7^, r = cr^. 



1. If X* is an interior point of X , then, for small enough a and 7, 

PK(/^post, ^x*) ^ 



CiV-rlogr + Ca;^ 



+ C^,cT^^''"^'^T\J- log (r(i-")/27") 



;i+o(i)), 
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where a = if A'^Vy^^^^^{x*)A is of full rank and a = 1 otherwise, and the constants 
are given by 



n oil iiV2 

Gi = 2||yexact||i max 



A I It/exact I loo \ 

\ •^min,pos 



= ''™;°,J |PaV^(x-)||, 

^min,posl,^ 

C3,a = - a)\^in{A^A) + aXm\n,i-PA{B{x* 



1/2 



If a = 0, the fastest rate is a\/ — log o , with 7 = a^l'^\— logcr] . 

If a = 1 and r = cr^, the fastest rate is o'^l^-J— log a, with 7 = cr^/^[— logcr]"^/^. 

2. If AJ- A is not of full rank and x* is on the boundary of X , we have an additional term 
of order —7^ log (037^) . 

5.3.2 General inequalities 

First we give upper bounds on the Ky Fan distance in terms of the moments of ||y — 
and then consider particular cases with independent observations. 

Theorem 6. 1. If 3a > 0: Ee^H^-^'ll < 00, then 

2. //3« > 0; E[||F-/i||'^] < 00, then 

PK{Y,fi) < [E\\Y-fi\n^. 
The proof is obvious, applying the Markov inequality to the corresponding function of 

\\y-f^\\- 

Now we evaluate the Ky Fan distance in the following two particular cases: the rescaled 
Poisson distribution corresponding to the tomography case (Section l2.ip . and Yt = fit + aZt 
where the distribution of Zt is independent of a. 

Example 4. A vector of rescaled independent Poisson random variables: Yt/r ~ Pois{fit/T) ■ 
Apply the ChernofJ- Cramer bound to obtain that for all t and all x,e > 0, 

t 

Now, Ee^l^*^^*l ^ Ee^*^^*~'^*) + Ee~^*^^*~'^*\ The cumulant function of a Poisson random 
variable Z with parameter A is logEe"^^ = A[e^ — 1]; hence, for Yt = a'^rZ and A = /it/r, the 
cumulant function of Yt — fit is 

ct{x) = logEe^^^*-'^*) = logEe^'"^ -x/it = — [e^'" - 1 -xr]. 

r 
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F{\\Y-fi\ \ > £) ^ e-'^Ee^ll^-'^ll ^ e-'^Ee^H^-'^ll^ = e""^ JJ Ee^l'^'" 



Hence, the cumulants of the resettled Poisson distribution ttre = /itO"^^'^ . Similarly, 
logEe-^^^*-'^*) = — [e"^" -1 + xt]^ Vx > 0. 

T 

Hence, denoting M = 2J2tf^t, we httve 

P(||y-/i|| >e)^ e-"V^'"'(^) = exp{-ea; + M[e^^ - 1-xt]/t}. 

Since x > is ttrbitrttry, we cttn tttke x corresponding to the minimum of the upper bound, 
which is ttchieved ttt x = log(l + e/M), implying 

mY-,\\>e)^ exp {-£±^ log (l + ^) + £} ^ exp (l - ^)} . 



2 

due to the inequttlity (1 + x) log(l + x) — x ^ ~ f) for smttll enough a; > 0. For 

e ^ 3M/2 we httve 

Using Lemmtt^ for r ^ l/(2eM), the solution o/exp{— £:^/(4Mr)} = e stttisfies 

e = V-2rMlog(2rM)(l + cu), 
where uj = o(l) as cr — )■ ttnd a; ^ 0. 

Theorem 7. Assume thttt Yt ttre independent, KYt = fit and Var(Yt) = Wtr. 

1. Assume thttt 3Ct ^ 1 such thttt K,t,k, the kth cumuhnt of Yt, is bounded by \K,t,k\ ^ 
CtWtT^~^ VA; > 2 ttnd Ct ttnd Wt ttre independent of t. Denote M = 2 Y2t CtWt- 

Then, forr ^ l/{2eM), 

PK{Y,fi) <: 2v/-rMlog(2rM)/2. 

2. Assume thttt 3K ^ 2; E\Ytf < oo ttnd^d > E\Yt\^+^ = oo. Assume thttt E\Yt - 
fit\^ ^ t'^^^^Lk for some Lk > thttt mtty depend on Ht or Wt but not on r, for some 
m{K) > 0. 

Then, for smttll enough t. 
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Proof. 1. Following the rescaled Poisson example, we have that the cumulant function for 
Yt is bounded by 

^ 1 — ""^^ ( Xl~^^ 

ct{x) = log Ee""^' = x/it + -i^WtT + ^ —Kk ^ xjjit + y^t"^ r ^ ~fcP^*^* 

i=3 ■ j=3 

= XAtt + ^«;tr + ^[e^^-l-XT-(xT)V2] 

2 T 

^ , CtWt T XT -X ^ 

T 

since Ct ^ 1. Similarly, logEe^^* can be bounded in the same way. Hence, we have 

¥{\\Y \ >e) ^ e-^V^*'=*(^) = exp{-£a; + — [e^" - 1 - xr]}. 

T 

where M = 2Y^^CtWt. Now, this is the same upper bound as for the rescaled Poisson 
distribution. Hence, we have the same inequality for the Ky Fan distance. 
2. Apply the Markov inequality to the random variable ||y — //| 



\K. 



WM.. M ^ m\Y-iim riT^^'^y^LK 

Hence, an upper bound on the Ky Fan distance satisfies utLk/z^ — z, i.e. z — [nr'^^^^/^Lx]''^^^^'''^^. 

□ 

The conditions in the first case are satisfied, for example, for the binomial distribution 
Yt ~ Bin{nt,pt), independently, since = ntlog{pte^ + Qt) ^ ntPt{e^ — !)• 
Here is an example for the second case. 

Example 5. Suppose we have Y-t following a t distribution with u degrees of freedom, means 
/It and scales y/rwt. Then we can take K — v — 2 — 5 for some S > 0; 

where pk is the Kth moment of the standard t^ distribution, i.e. m{K) — K/2 and Lk — 
wfi/K. Hence, 

Note that this bound holds if Yt can be written asYt — iit + crwtZt where Zt are iid and 
whose distribution is independent of t. 



6 Approximation of the posterior distribution 

For completeness, we shall also show how the posterior distribution can be rescaled so that 
it converges to a finite limit. This can be used to approximate the posterior distribution in 
practice, for small values of r. 
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For a different iable identifiable likelihood and prior distribution positive and continuous 
at the "true" value of the parameter, the posterior distribution is asymptotically Gaussian 
in the case where the "true" parameter is an interior point of the parameter space. This 
result is known as the Bernstein-von Mises theorem. Van der Vaart (1998) gives a total 
variation distance version of the theorem under mild additional assumptions on the error 
model, adapted from Le Cam (1953) and Le Cam and Yang (1990). The theorem in fact 
implies that, under the above conditions, the prior distribution has no influence on the 
asymptotic distribution. 

We extend the Bernstein-von Mises theorem in two directions. Firstly, the assumption 
of identifiability of the likelihood is relaxed; a consequence is that the limit of the posterior 
distribution, as well as the rate of convergence, depend on the choice of the prior distribution. 
Secondly, the assumption that the "true" value of the parameter is an interior point of the 
parameter space is relaxed, by assuming that it can lie on the boundary. In the latter case, 
we show that the limiting distribution changes from Gaussian to a product of Gaussian and 
exponential in different directions. 

Now we make use of the three-part transformation of the re-centered variable x — from 
Section 14.31 w = U~^v = U^^{x — x*). Define the following scaling transform S = Sr^-y'- 
X -X* ^ xRP^: S= {So, Si, S2), with 



r. 



^0 = (U )oix-x 

51 = (f/-i)i(x-x^)/7, (29) 

52 = {U-%{x-x'^)h\ 

Denote the posterior distribution oi 6 = S{x — x*) given Y{(jj) by /ipost(i^)- 
The limiting distribution is defined in terms of the following parameters: 

^^00 = f/oVV._.(a:*)f/o^, 
B = V^g{x'), 

and Bij = UfBUj, i,j = 0, 1, 2. 

Now we can formulate a version of the Bernstein - von Mises theorem for this problem. 



Theorem 8. Consider the Bayesian model defined in Section and let the assumptions 



on fy and g stated in Section hold. Assume that condition ( fi^) holds. 

Assume that matrices Qqq and Bu are of full rank, that Bqq — BqiB^^Biq > 0, and that 



^ ^ 0, 72 = o(ri/3) 
T 



r r Var(Y) 
c = iim — — < 00, iim < 00. 

r— >0 7^ r— >0 T 
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Assume that the following limit exists for allu: limT-_!.o[t/^V/y((^)(x*)/-\/r] < oo, and denote 



.T — 



Denote by fi* the following measure on 



PPO+Pl 



(30) 



H*(^uj)=J\fp, (aoM,fioo') X (0,5n ) x Exp^^ (6). 
Then, as r — )■ 0, 

r5(x-x.*)|y-/i1lTy''^-T^0 as r ^ 0. 
An upper bound on the total variation distance is given in the proof of the theorem. 

Remark 1. We assumed that V,,/y^,^^^t(r7)|G(r?)=yexact = 0, i.e. that the likelihood is regu- 
lar with respect to the "natural" parameter. If this assumption does not hold, we have an 
additional linear subspace determined by A and V,,/y^^^^t(G'~^(?/cxact)) where the limit is ex- 
ponential. 



A Appendix: proofs 

A.l Proof of Proposition [1] in Section [3] 

Proof. For arbitrary u > 0, suppose x G M"*^ is such that {A^A + i'B)x = Op, the zero vector 
in W. We have to show that x = Op. But {A^A + uB)x = Op imphes x'^{A^A + i'B)x = 0, 
and so by non-negative-definiteness of B and A^A, x^Bx = = x'^A'^Ax. But then Bx = 
Op = A^Ax, and so x^[B : A^A] = 0^^. By the assumed full rank of this matrix, x must be 
Op. 

Now fix uq > 0. By Theorem 2 of ?, page 313, (with his A replaced by A^A + uqB), there 
exists a nonsingular real matrix P, not necessarily orthogonal, such that P^{A^A + i'qB)P = 
I and P^ BP is the diagonal matrix A of the solutions for A to |i? — X^Al'" A + vqB)\ = 
(which all satisfy < A < Uq^). But then P'^A^AP = I - vqA and for any z/ > 0, 
P'^{A^ A + vB)P = / + (z/ — z/o)A, both of which are of course also diagonal. 

The matrix P can depend on the choice of z/q, but evidently always diagonalises A^ A., 
B and any linear combination. Also, A depends on z^o? but since \B — \{A^ A + i'qB)\ = 
(1 — Xi/qYIB — aA^A\ where a = A/(l — Az/q), the (diagonal) elements in A are A^ = 
ai/{l + ttiZ/o) where {ai} are the solutions to \B — aA^A\ = (possibly some = +oo). 
So P^BP = diag(ai/(l + ttiZ/o)), P^A^AP = diag(l/(l + OiZ/o)) and for any z/, P^(A^A + 
uB)P = diag((l + ajZ/)/(l + ajZ/o)). For the final assertions, note that v{A'" A + vB)^^ = 
Pdiag(z/(1 + ajZ/o)/(l + aiv))P^ , which converges to PdisLg{5i)P^ = C, say, where 6i = z/q if 
ai = +00, and otherwise. Further, we can estimate the difference: v{A^ A + vB)^^ — C = 
Pdiag(z/(1 + Q;jZ/o)/(l + ajZ/) — 6i)P'^ = z/Pdiag(0j)P^ + 0(z/^), where 0j = if = +oo and 
otherwise 0j = 1 + ajZ/Q. □ 
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A. 2 Proofs of the results in Section [5] 



Proposition 2. Let assumptions on fy, g in Section and assumptions / flgj) on 5 hold. 

Assume that UQi[A^Vy^^^^Xx*)A : i?(x*)]f/(^ is of full rank, and that r/i^ — j- and — )■ 
as r — !■ 0. 



Then, for any e G {cipv^iY, ^cxact) +£2^^, 5) snc/i i/iai ~ fievjr — )• 00, /or ant/ /3 G [0, 1] 

fx\B{x*,e) 



JX\B{x* "-^ 



1-r 



2r 



+ p2exp <; \ (1 + Aij 



r 



An 



1 + A2 

1 + Ao ^ 1 + Ao' 



and, in particular. 



^(p+P2)/2 [27r]P«i/2 det{U) _ / x^HHoifiiXo - ^hyix" 



where Ai and A2 are defined by 



exp 



2r 



[I + A3 



1 _ rr^2 n _ f,-{b,-SXmMPiB{x*)))VT^eu/T\ 



P2e" 



^2(5, P2,y) 



1 + &r'5[A„,ax(f/25(a;^)) + 25^Cy;/3] 



X 



r 



1 - K'6[X^,,iU2B{x*)) + 25Vpp^C,3/3] J 

(-f^-f^oi,oi)[<^ - ll-^-f^oi,oi-^-^oi,oia;o||]^ I poi 
2(1 -7)t ' 2 



n -1 



(31) 
(32) 

det(Moi,oi + (5VpD) V/^ 
Ldet(Moi,oi-5v^/^)l ^ 



P2 

exp |5v^4i7i/oi,oi^i^o"i!oi^Mo"i!oi^^oi,oia:o/r} H - e'^^'^/^^^ 

j=i 



'7P2) 



-1 



where D = diag{po{Cf3 + 2h'Cg3)Ipf^,2h'piCg3lpJ /3 and A3 zs defined in the proof of the 
proposition. 

An upper bound on A2 is given by: 
A2{6,p2,y) < -1 + 



P2 



Ll - b;^J[Xm..{U2B{x*)) + 26,/pp^Cgs/3] 



{HHoi^oi)[S- \\HHq^q^HHoi,oiXo\\] poi 



1 -1 



X exp 



2(1 -7)r 
t[1 - 6^pXl,^{DHH^')] 



1 — exp 



2 



1 + S^K^^iDHH, 



01,01) 



1 - 5^K,UDHH^l, 



01^ 



-P2 



(34) 

POl/2 

(35) 



(36) 



Here Amm(-f^-f^oi,oi-D 



mm 



C/3+2!^Ca3 



2G 
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Proof of Proposition\^ Taylor decomposition of hy{x) at x* for x G B{x*, S) gives 



hy{x) 



where 

lAooWl 



^ ijkhyijI^cj^Xi Xj^{Xj Xj)(^Xj^ X^ 



< 



< 



ijk 
X — X* 1 1 1 



6 



Cf3\\{U-'),{x-x^)\\l + uC,sJ2\\{U-'), 



[x-x 1 



k=0 



[po(C/3 + 2uCgs)\\ {U-')^ (x - x'')\\l + Aup,Cgs\\ {U-')^ (x - x 



*\ll2 
2 



+AuC,s\\{U-')^{x-x^)\\l] 
for some Xc G {x,x*). 

Denote bo = HHqi qiXq = Vhy{x*). Make a change of variables Vq = {U~^)q^ (x — x*)/-\/r 
and f 2 = z/ {U~^)2 (x — x*)/r > 0, the Jacobian is J = t'^^'+^^^/^z/'P^ det(f/) and we have 



B{x*,S) 



e-hy^^)l-dx > Jexp|-i /ij,(x^) - ||MoY,oia;o||V2 



rl/2 



X 



/ exp {-\\\HHlil,{v, - r-V^Xo)|P - b^vA 



'r|bolP+r2||^,2||2A 

X exp {-(5v^V(fdiag (po(C/3 + 2uCg3)Ip^,AupiCg3lp,) vq/Q} 

T 



X exp { -V^vlUlB{x^)U2V2 - —v^U^ B{x*)U2V2 - rd^—^Wv^Wl \ dv^dv,. 



Note that UfHy{x*)U2 = vUf BU2 for any i. 
For vq,V2 such that r||t;o|p + r^l |t^2| | V'^^ < 



V^v^UjB{x*)UoVo + —v^UjBix*)U2V2 < \mUU2B{x'mv2\\ 
< X^UU2B{x''))\\v2\\iS. 
Denoting 

-f^-f^oi,oi 



T 



.1/2 



Moi.oi + Sy^D, 



-2C„ 



b = b + 6\^UU2B{x^)) + 6^^2^. 



we can bound the integral above by 



B{x*,&) 



e-'^y^'^^'^dx > Jexp |-i \hy{x'') - WHR^lilb^W'^ /2 



X 



\\V0\?+T^\\v2\\^/U^<i^ 



exp <! -^||MoY,oi(^o - T-^'^'HH,lMf -Vv2 \ dv2dvo 
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Hence, 

e-^WA(^a; > jexp 



B(x*,5) 



X 



X 



> 



||f0|P+r2||l,2|P/l/2<52 



2hy{x*)-\\HH-,]Sbo\\' 
exp 1-6^^21 

Jexp \ jh,{x*)-\\HH^llib.\? \ Jj6-[2.r/^[det(#i^oi,oi)]-^/^ 



X f] [l - e-^'^'^/(^^ 

i=l 



'ap2) 



A min(-H"-ff01,0l)[^ - ||-g-goi|oi&o||]^ Poi 

2(1 -q;)t ' 2 



= Jexp 6ri[2^]m/2[det(Moi,oi)]-^/^[l + A3] 

for some o; e [0, 1], since 5v/t — > 00, Ainin(-f^-f^oi,oi)<^^/''' ~^ <^ ^ lko|| with high Py^^^^ 
probabihty. Here A3 is defined by 



P2 

A3((^,P2,y) = -l + n 



X 



X 



P2 _ 

exp [-5^hlHH^,\,DHH^,]M{2T)] - e"^^ 

i=l 



A min(-H"-H"oi,Ol)[^ - ||-H"-H"oi|oi&o||]^ poi 

2(1 -q;)t ' 



[det{I + 6^HH,l,,D)\ 



-1/2 



Similarly, 

< Jexp {-/i,(x*)/r + ||MoYjiXo||7(2r)} 



X / exp 

/£2<T||,;o||2+T2||t,2||Vl'2<52 

I uuni-i/ uC^ 

X exp 



^ VTVq Uq^B{x )U2V2 - —V2 U2 B{X )U2V2 

X exp jr^v^^^ 11^2! Ii I dvarfvo 

'hy{x*)-\\HH^^iX\\V^ 



< J exp — 



X 



/£2<r||i;o||2+r2||,;2||2/i/2<52 



exp |-i||moY,oi(^^o - T-V2Moi;oi&o)ir - ^^^^2} rf^2rf^;o, 
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where 

i^i/oi.oi = HHoi,oi-S^D, 



2Cg3 



Assume that S is small enough so that bi > for all i and i?i7oi,oi is positive definite. 
The ring {e^ < rHvoH^ + r^||t'2|P/z^^ < S"^} is a subset of 

{ I l^ol r < Pe'/r & 1 1^2| r > is' - r| [t^ol ^^7^'} U {Pe'/r < \ \vo\\'} 
for any /3 e (0, 1), therefore 

exp \-Fv2 - \\HHlil^{vQ - r-^/2^i/o"i'oi^o)ir/2| dv2dv^ 

1/2 

1)2|P>£2(1-/3)z/2/t2 



< [27rr/2[det(Moi,oi)]-'/' / exp{-Fv2}dv2 

J||i;2|P>£2(l-/3)i/2/-r2 

+ [ expl-l\\HHill,{vo-T-'/mH^,]M\Advo. 



|^;o||2>/3£2/r 

The integral over the exponential density can be bounded by 



/ exp {-6^'i;2} d'i;2 < / exp {-6^'i;2} d'i;2 

^||j;2|P>£2(l-/3)i^Vr2 J\\v2\\oo>eVT^W'r 
i i=l i 



since v'l — Psu/r ^ oo as r ^ 0, where 
Ai(£,(5,p2) = -1 + 



1 1 i-mi(i-e-'^'-^-/-) 

1 _ ("1 _ g-5minyT^£I^/T'|P2 

< -IH ^ ^ 



P2 / \ 

Hence, Ai is small if \J\ — fiev/T — >■ oo and 5^1 — fiev/T — >■ as r — >■ 0. 
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Combining these results together, we have 



Ib(x 



(x* ,5)\B{x* ,e) 



-hy{x)/T^^ 



B{x*,S) 



-hy{x)/T(^^ 



< 



P2e 



+ 1-r 



2t ' " 



n 



r 



1 + K'6[X^,,{U2B{x^)) + 2S^Cg;/3] 
1 - 6-'5[A^ax(f/2fi(x^)) + 25^C,s/3] 



2(1 -a)r ' 2 

P2 

exp {5^b^HH^i,,DHH,,]Mr} J] [l 



-1 r ~ 1 1/2 

det(Moi,oi) 

det(Moi,oi) 

1 



-fei(5!//(T^«P2) 



i=l 



Now we take into account the error of approximating the integral over X by the integral 
over B{x*, e): 



X\B{x*,e) 



-hy{x)/r^^ r 

J B(x 



(x* ,S)\B{x* ,e) 



/B(x*,(5) "'•^ ' JA'\_B(x*,<5) 

JB{x*,5)\B(x*,e)^ "-^ 



-hyi^)/-rdx 



+ 



(1 + Ao)/^(^..^^)e-'^«(-)/-dx 1 + Ao' 
Choosing some fixed a, e.g. a = 1/2, we have the required statement. 



□ 



Proof of Theorems\^and\^ By Strassen's theorem, for any x, pp(/ipost(i^)j <^x) = PK(^,a^) 
where ^ ~ /ipost(i^)- Hence, we find an upper bound on the Ky Fan distance between X \ Y 
and X*. 

Denote Sq = y/^e and ei = y/1 — l3 e for some /3 G (0, 1). Then, eo+ei = e{y/j3 + \/l — l3). 
Take Eq > \ \xo\\. Using Proposition [21 we have that an upper bound on e satisfies 



X 



B{x* ,5)\B{x* ,e) 



-hy{x)/r^^ 



B(x*,5) 



-hy{x)/T^r^ 



P2exp 



+i-r 



-^i^}(l + AO 
{^0 - \ \HHmmbo\\Y 



2r 



(1 + A2) + Ao 



where A = Ao/(l + Ao) and A2 = (1 + A2)/(1 + Aq) — 1. By an assumption of the Theorems, 

Ao < CiP/^(F,ycxact) + C2Z/. 
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In particular, for some a G (0, 1), with a = a/i/fl and a = (1 — a) / a/1 — /3, take the 
smallest e > such that 



An < e, 



1-r 



(gp - ||-^-^Ol|oifeo||)lAmm(-^-^01,Ol) . Po ^ 

2r ' 2 



Pi exp 



r 



1 + A2 
1 + Ai)(l + A2) < aei. 



The last two inequalities imply that as r/ Amin(-ffi^oi,oi) 0, Eq —> and £:QAmin(-f^-f^oi,oi)/ t 
00. Similarly, as z//r —i- 00, £1 — )■ 0. Hence, using Lemmas [3] and HI we have that 



^0 ^ I l-^-^oi!oi-^-^oi,oi2;o| 



\ 



lo I C 

;i+A2)2A^i„(Moi,oi) V VA„,i„(moi,oi)(l + A2J 



log 



_rp2_ 



By Lemma [71 



koll < 



log (^(1 + Ai)(l + A2) 

Mfi\\y - z/cxactll + v\\PATVg{x* 

-^minpos 



which we can substitute into the upper bound for Eq, and 

||Mol|o,Moi,oi|| = ||(/-5v^Moi!oi^)"'ll = [l - 5 ^K^D H H^^, 



-1 



Now, the smallest e > such that e > Aq and satisfies the obtained upper bound on 
E = {EQ + El)/{^/^+^/T^^) is the uppcr bound if it is greater than Aq. Otherwise, e < 2Aq. 

The bound on Eq with Uqi = I and a = 1 gives the statement of Theorem [1], and adding 
up the bounds on Eq and Ei divided by a/5 + a/1 — /3 gives the statement of Theorem [21 with 
the errors denoted by 



A*((5,p2,l/) 
A**((5,p2,y) 



1 + Ac 



1 + 2 



log(l + A2) - log(l + Ac 



(1 + A2)(a + VT^) 
log((l + AO(l + A2)/(l + Ao)) 

log (rp2/(z^&minVl " O^)) 



log (XnnniHHoifll)/ {o^t)) 



1/2 



1, (37) 



To simplify the expressions, the results are stated with a = (3, making a = y/P and 
a = a/1 — /3, i.e. + = 1. 

We assumed that e < 6; in particular, the upper bound on e is less than 6 if z/^/r = 
— )■ and — 7'^log7/r — ?■ (the latter - in the ill-posed case). 

□ 
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Proof of Theorems^ and\^ Now we prove Theorems H] and O in the notation defined in the 
proof of Theorems [T] and El 

We apply Theorem [3] with VLi = {uj : \\Y{uj) - 2/exactll < Pk(>^, Z/exact)} and VL2 = VL\Vti 
with P(fi2) < PK(^;2/cxact) by the definition of Ky Fan distance, with the bounds given in 
Theorems m and |5] which we modify to depend on y only via \ \y — ^/exactll- For small enough 
r, the assumption of the theorems that UQi[A^Vy{x*)A : i?(x*)]f/(^ is of full rank holds on 

as we shall show below. 

The upper bound depends on y via | |?/-i/exact| I, \n,\n{UmHy{x'')U'^^), X^nm pos{A'^Vy{x*)A), 
Aq and A2. 

We start bounding the eigenvalues from below. Denote HHq^q^ = f/oi-f^i/cxact(^*)f^oi- 
Since [U^{Hy{x*) - Hy_Jx^)Uo)]^j = [VV._.(^^) - "^'fyix^h > -Mf2\\y - 2/exact||, on 
Qi we have, by Lemma [H 



01,0l)> 



where 1(0) = 1 if the minimum eigenvalue is achieved on the subspace Uq, and is zero 
otherwise. For small enough r and 6, on Qi the lower bound is positive. 
Similarly, since A'^Vy{x)A = — V^/y(x), 

|[A^(Vy(x^)-V,_,(x^))A],,| ^ M^2||y-Z/exact|| for all 2, J, 

hence Amin pos(^^Vy(a;*)A) > Aminpos(^^K/exact(a;*)^) - ^/2||^ - l/exact||- Hence, since 
\\H Hfy^Q-j^H HoifiiW < 1, we have that 



'01,01'- 



ll^^oi!oi 


boW < 

Cll 


Mfi\ 


\ Y ^cxact 1 


1 + l^l 


\PAT\/gix*)\\ 


\Y - 


Ammpos(^"^Vy ( 
^/cxactll + C2Z^ 


x*)A) 


1-Mf2\ 


1^ ycxact 1 1 / Amin 


,os{A^Vy_,{x*)Ay 



< 

where Ci and C2 are defined by fl23|) . 

To bound A2 on we need to bound below the following expression for S > | |i/ifoi^oi^o| I 
on Qi: 

Xmm{HHoi^Ol)[S - ||H'i7o"i|oi6o||]^ > [-^min(-f^-f^0l,0l) " Mf2PK(Y, 2/exact)] [S - Ci\\Y - ?/exact 1 1 " C2l^] 

where = [1 - M/2Pk(^, 2/cxact)/Amm pos{A'^Vy^^^^^{x'')A)]-'^ for k = l,2. 
Note also, that on fli, 

, . ;3A^in,p,,,,(A^V;(x*)A + z/S(x^)) 3A^i,,,_p,_,(5(x*)) 

\nin{HHoifiiD ) = mm 



> 3 min 

def 



Po{Cfs + 2iyCg3) ' 2p,Cy3 

Xmin,pos{A''Vy^^^Jx*)A) - Mf2PK{Y,ye..ct) >^min,I-PX^y{B{x*)) 
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A 



DH- 
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Hence, on Qi, 
A2(5,p2,Z/) < -1 + 



X 



1 + b^J[X^UU2B{x*)) + 26^Cg3/S] 

[XmmiHH^ 



P2 



Oi^Ol) ~ Mf2PK{Y, 1/cxact)] [6 - Ci\\Y - ?/cxact 1 1 " C2Z/] poi 



X exp 



1 + 

1 - Sy/p/XDn] 

{ 



POl/2 



1 — exp 



TA/P2/2 

J (5^(C/3 + 2z/Cg3)[clpK('i^, Z/cxact) + C^I^] 



~P2 



3r[l - 52p/A 



(3^ 



This gives an upper bound on A2. 

The upper bound increases in Aq, hence we need to bound it from above. By Assumption 
4a) in Section 

/ exp{-[hy{x)-hy{x*)]/T}dx <e^^^f''>^^y-y^--^'^^^^ [ 

Jx\B{x*,S) Jx' 

Using Lemma [7] and the lower bound on the minimum eigenvalue of HHqi qi, we have, on 



' X\B(x*,5) 



exp{-[Vxact(2;)-/iycxact(a;*)]/r}cix. 



XqHHqi^OIXo > 



HPATVgix*)\\ - MyipK(r,yexact)]^ 
Xrainpos{A^Vy^^^^^{x*)A) - M/ 2PK J/cxact ) 



and, by Lemma El on Qi, 

det(Moi,oi) < det(Mo\,Ol)[l + ^/2PK(V',2/exact)]^°. 

Hence, Aq is bounded on Qi from above by 



AS(5(0,5)) 









- dx 


I - J 


exp < 


[Cl I/-C2PK {y,y exact )] ^ 

I 2r J 


\ 



n,(&.)[det(Mo\,Ol)]'/'[l + M;2PK(l^,l/cxact)]^»/' 



(39) 
(40) 



[27r]w'i/2[l + A*] 

where A3 is a lower bound on A3 on Qi derived in a similar way. 

Using Aminoi = X^inUoiiHy^^^.^ix")) and A22 = M/2Pk(>^, l/exact)/Aminoi, we have that, on 

Ci\\Y - ?/exact|| + C2Z/ 



^0 < 



+ 



£1 < 



1 - Sy/P/XOH 



:i + A^)2A^i„oi(l- A22; 



log a 



'POl 



0?T 



A^inOl(l- A22)(l+ A^)2 



log 



rp2 



+ log((l + Ai)(l + A^)) 



34 



since the function —xlogx increases for x < 1/e. 

The bound on Eq increases in \\y — Z/cxactH, and the bound on ei is independent of it. 
Assume that Aq/(1 + Aq) is less than the upper bound on (^o + + y/l — a^) on Qi. 

Using the hfting Theorem [3l we have that, for small enough r, u, 

CiPk(^, 2/exact) + C2Z/ 



PK (yUpost , 4* ) < max <^ 2pK (Y, ?/exact ) 



+ 



^1 



v/(l + A*)2A^in0l(l- A22 

r r. / TP2 



log a 



log 



Denoting 

A^,ii-((5, P2) 



(l + A5)(l-A22)i/2 
log((l + Ai)(l + A5)) 



-^minOl (1-A22)(1 

+ log((l + Ai)(l + A5)) 



A^)^ 



,log(l + A^) + 0.51og(l-A22)] 

log (AminOl 

/(aV)) 



1/2 



log 



we have the statement of Theorem [51 



Proof of Theorem First we note that 

p { d^{Xi{uj),X2{uj)) ^ ^i{pk{Yi, Y2)) n ni} 

+ P { X2(a;)) ^ $2(pk(V'i, ^2)) n Q2} 

> p{$i(rf,(riH,r2M)) ^ $i(pk(ii,12)) nr^i} 

+ P { $2^(^-1(0;), Y2{uj))) ^ <I>2(pk(V'i, 1^2)) n ^2} 

= F{dy{Y,{u),Y2{uj))^pK{Y,,Y2)nn,} 

+ ¥{dy{Y,{io),Y2{io))^pK{Y,,Y2)nn2} 

= F{dy{Y,{uj),Y2{uj))^pK{Y,,Y2)} 

> 1-pk{Y,,Y2). 

On the other hand, 

P { X2M) ^ $ i(pK(n, Y2)) n Q,} 

+ p { X2M) ^ $ 2(pk(V'i, k,)) n ^]2} 

< P{4(Xi(a;),X2(a;)) ^ $i(pK(ri, 1^2))} + P { ^^2} • 
Putting these together implies 

P{4(Xi(a;),X2M) > ^i{pK{YuY2))}<pK{Yi,Y2)+F{n2} 



(41) 



□ 
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hence, using Lemma El we have 

Pk(Xi, X2) ^ max {$i(pk(11, ^-2)), PkI^I, Y^) + P(^^2)} , 
and we have the first statement. The second statement follows from the first inequality and 
P { X2(u;)) ^ $i(pk(V'i, 1^2)) n n^} 

+ p { X2M) ^ $2^(^-1, K,)) n ^]2} 

< P{4(Xi(a;),X2H) ^max[<l>i(pK(>l,>^2)),<f2(pK(>^i,1^2))]}. 

□ 

A. 3 Ky Fan distance inequalities 



In this section we quote the result by Hofinger and Pikkarainen (2007) 



Lemma 3. (Lemma 1, Hofinger and Pikkarainen (2007)) Let ^ ~ ■f^pifJ', S). Defi 



me 



^2./{p + ir ^fp^sodd, ^^^^ 
2P/p^ if p is even. 

Kp = max{l,]9 — 2} (43) 

Then there exists a positive constant 6{p) such that for any S; < 6{p), 

pp(Ar(/i,S),5^) < (-||S||log{C,||S|r^})^/^ (44) 

In particular, we will use the following bound on the solution z = z{p, A) of 

.(p,«) = inf{.: l-r(ii||.l)<.}. 

given in the proof of this lemma for sufficiently small a: 

z{p,a) < [-a\og{Cpa^-)f\ (45) 

Here r(x|a, b) is the cumulative distribution function of the Gamma distribution T{a, b) with 
probability density function f(x) = |^x"~^e~^'^, x > 0. 

Lemma 4. Assume that A ^ and A ^ e^^. Then the solution of 

exp{— z/y4} = z 

satisfies 

z = -Alog{A){l + uj), 
where u ^ and u = o(l) as A ^ 0. 
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Proof. Taking the logarithm of the given expression, we have 

—z/A — logz 

Since A — >■ 0, we must have z/ log z ^ which implies 2; — >■ 0. Denote / = z/A, i.e. z — Af. 
Hence, the equation above can be rewritten as 

-/ = log^ + log/ 

implying that / — )■ 00 as ^4 ^ at the rate / = — log 74(1 + o(l)). Hence, the solution is 
z = -AlogA(l + 0(1)). 

To show that z ^ z^ = —Alog{A), we note that for A ^ e~^, 

exp{z^/A}z^ = exp{-log(A)}(-Alog(>l)) = -log(A) ^ 1 = ex.p{z/A}z 

implying the desired inequality. 

□ 

The following lemma follows obviously from the definition of Ky Fan distance. 
Lemma 5. IfF{d{X,Y) > £1) ^ £2 for some 81,82 G (0, 1), then pY^{X,Y) ^ max(£i,£2)- 

A. 4 Auxiliary results 

Lemma 6. Assume that a, (5 > satisfy af3 < 1/4. 
Ifz< l/{2/3) and z ^a + /3z'^, then z ^ 2a. 

// q; — >■ and (5 is bounded, the solution of z — a + f3z'^ such that z < 1/(2(3) satisfies 
z = q;(1 + 0(1)). 

The proof is obvious. 

Define the following projections 

Py = V^V, 

p^ y = {A^VAyA^VA^ A^PyA. 
Lemma 7. // [A'^Vy{x)A : B{x)] is of full rank, 

\\H-\x)\\ = [mm{X^i^^p^^^{A^Vy{x)A + uB{x)),uXrnin,i-PAA^i''M~'^ 

where Xjam,p{B{x)) — min||„||=i p^=„ [[^(x)^;!! is the smallest eigenvalue of B{x) on the range 
of P. 
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In particular, if Hy{x) is of full rank and V^hy{x) is uniformly bounded on B{x*,S) for 
y G yioc, 

^ mill [Xmin,pos{A'^Vy{x)A) + iyXmin,PA,viB{x)), pXmin,I-PA,viB{x))] ' 

+ 1 ^-TrT^ ["''WC- Pa.v) V/,(x*)|| + ||(/ - P^y) Vg(x' 

Proof, of Lemma [71 

The norm of is given by 

\\H-^\\ = a'^[X„,in{A^VA + uB)]-'^ =a^[mm \\{A^VA + uB)x\\]-^ 

\\x\\=l 



a^[mm \\{A^VA + uB)Patx + uB){I - PA'r)x\\]-^ 



X =1 



= cr2[min( min \\{A^V A + uB)Patx\\ min p\\B{I - Pat)x\\)]-^ 

= [min(A^i„, {A^VA + , z/A^i„, 7_p^^ (5))] 

Weyl inequality implies that A^m, p^^ (^^^^4 + z/i?) ^ A™„,p^y (A'^T^A) + uXmin,p^j.iB)- 
Note that since we assumed that Vy^^^^^{x*) is of full rank, the projection on the range of 

A^ coincides with the projection on the range of A'^Vy^^^^^{x*)A. 

Now we find an upper bound on \\PIy^^^^ti^*)~^'^^yi^*)\\ using the first statement in 

Lemma [D 

\\Hy^..JxT''^hyix^)\\ = r-i||iJ,_,(a;^)-i(V/,(x^) + z/V^7(x^ 



^ r-'\\Hy_,ixT'\\PAr\\PATi'^fy{xn + l^'^9ixn)\\ 

+ T-'\\Hy_JxT%-PATm - PAT)[Vfy{x*) + uVg{x'')]\\ 

^ XrmnMA''Vy_,{x*)A) + Z.A_,p^, (S (x^) ) " ^^^^ ^"^^^ ^ "^^^''^ 

+ T ^---r\\iI-PAT)[Vfyixn + uVg{xn]\\ 



[\\PATVfy{x^)\\ + l^\\PAT Vgjx^m 
Xmin,pos 

iA^Vy^^^Jx*)A) + pX^i^,p^^{B{x*)) 



-^minJ-P^TK-^y^ )) 

□ 

Lemmas. i. \\iC + 6I)-'x\\ ^ (6 + XkiC))-^\\Pcx\\ + 6-^\\iI - Pc)x\\ 

where k = rank{C) and Xk{C) is the smallest positive eigenvalue of C , and Pc = C^C 
is the projection matrix. 
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2. Cauchy's interlacing theorem (1): let C = C'^ be a n x n matrix, L any n — k dimen- 
sional linear subspace, and Cl = PlCPl- Then, for any j = 1, . . . ,n — k, 

minD,>o AAminpos(^^^) where D is a diagonal matrix with non- 
negative entries. 

Proof, of Lemma [H] 

3. Xj{A^DA) = Xj{D^/^AA^D^/^), and since j ^ rank(A^DA) = Tank{D^/^AA^D^/^), 

XAD^^^AA'^D^/^) ^ min DiXAPnAA^ Pd) ^ min AA,+™(AA^) 

D^>0 Di>0 

by Cauchy's interlacing theorem, where m = rank(PD), n = dim(D). 

If j = r = rank(P^TPD), Xr{A'^DA) is the smallest positive eigenvalue of A^DA, and 
j + m = rank(PD) +rank(P4TPD) ^ rank(P4T). Hence Xr+mi^^^) ^ \£i'nk{p 
the latter is the smallest positive eigenvalue of A'^A. 

□ 



A. 5 Proof of Bernstein — von Mises theorem in a nonregular case 

Proof. Denote a = i/r. An upper bound on the total variation distance between the rescaled 
posterior distribution and its limit is given by 

\\^(S{x-x*)\Y) - ^J*\\tV < \\^(S{x-x*)\Y)'^BR - fJ''*"^Bit}\TV 

+ I IA** ~ A^^lSfll ItF + \ \^S{x-x*)\y'^Br — ^S(x~x*)\y\\tVj 

where the balls Bji are defined below. Here /il_B^ is a probability measure fi truncated to 
and normalised to be a probability measure. We start with the distance between the 
truncations of the rescaled posterior distribution and the limit on Bj^. 

Denote Wk = (U^^) {x — x*), and rescale to vq = wo/a, vi = wi/'j and V2 = ^2/7^, with 
the Jacobian of this change of variables being J = a^0ryPi+'^V2 clet(f/). 

Consider a neighbourhood of a;*, Bs{x*) = x* + Bg, where Bg = B2{0,So) x B2(0,5i) x 
-600(0,52)- For the rescaled parameter, we use the corresponding neighbourhood B^ = 
B2{0,Ro) X 52(0,i?i) X 5oo(0,i?2) where 

Ro = 60/a, R^ = 5i/7, R2 = S2/1'. 

We assume that Sk are such that 6k ^ and Rk ^ 00. In addition, we will need conditions 
6f <^ 5o, 61 ^ 60 and 60 ^ u = In particular, we can take 60 = v^l"^ , b\ = ^7 and ^2 = 7. 
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Approximate hy{x) by a quadratic function using Taylor decomposition in a neighbour- 
hood of X*: 

hy{x) = hy{x*)^[Vhy{x^)Y{x-x^) + ]^{x-x''fH{x-x'') + ^Qo{x). 
Note that 

Vhy{x*) = PAT{Vfy{x*) + vVg{x*)) + v{I-Pj,T)Vg{x*). 
Bound Aqo on w e ^5 using Taylor decomposition of hy{x): 3xc G {x,x*): 



|Aoo(5)| 



^ ^ ^ijkhy (Xc) {Xi ) [Xj Xj) [Xk X/, 



ijk 



- [i^ - x*fW\ix,){x - X*)] 



^fe 1 



k=0 



On Bs, 



Denote 



Then, 



< -||x-x*||i lltyol 

D 

< ^\\x-x*\\i [\\wo\\l{Cf3 + 3iyCg3) + 3iyCg3[\\wi\\l + \\w2\\l]] . 



def 

\x-x*\\i = \ \wo\\i + ||wi||i + \ \w2\\i < VPo^o + +P2S2 = S. 



Do = Po(C/3/3 + iyCg3)Ip^, Di = PiCgsIp^ . 



[hy{x) - hy{x*)]/a'^ < h^V2 + ^(wo - xo/(7)^ifoo(^'o - xq/g) + ^^v^BiiVx 

+ x/i;<5oit;i - I |iJo'o',To| 17(2^2) + 5[v^DoVo + t;r^it;i]/2 

+ [avlU^ + -ivlUl]BU2V2 + '^^f^l^\\v2\\l 

+ —V2 B22V2, 



where //^j = U^HUj, Bij = C/f SC/^-, i, j e {0, 1, 2}, and Hjk = vUjBUk for (j, A;) ^ (0, 0). 
Therefore, we have that 

[hy{x) - hy{x*)]/a^ < hlv2 + \\Hll,\vo - H^o'HooXo/a)\\y2 + \\Bli\vi + V^B^,'BioVo)\\l/2 

- \\H-,'/'HooXo\\V{2a'), 
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where Bn = Bu + SiDi, Hqo = Hqo - uBqiB-^^Bw + SoDq, and 



6 + 1 



(^o||-Bo2||2,oo + (^l||-Bl2||2,oo + | I-B22 1 1 l,oo + SP2S2 J'^ 



since on Bs, 



1 



[aVQ UqB + B] U2V2 + yfs B2V2 < [5o| I-B02I |2,oo + 5l| I-B12I |2,oo] | |t'2| |l + y||-B22||l,oo||t'2||l- 

Therefore, 



> 



exp{-[hy{x)-hy{x*)]/a^dx > J exp{\\H-'/''HooXo\\V{2a'')} 
I exp {-lFv2 - \\Hll,\v^ - ^oV^oo^o/<t)||V2 - \\B\i\v^ + ^B^^B,^vo)\\ll2\ dv 
Jexp|||#oo'^=^^ooa;oir/(2(x')} [det(^oo) det(5n)] 



x*+Bs 
X 



P2 



X 



P2 



(27r)(P' 



'0+Pl)/2 



xr 



2(72 



^^ixr 



272 ' 2 



= Jexp{| |i/oo'^\| I V(2a2)}M^(i/, 5, 6, u)C{H, B, b), 



since we have that Si ^ 5o- 
Denote 



P2 



MR{Q,A,b,u) = J][l-exp{-6,i?2}]r 



i=l 

X r 

P2 



{Qoo)[Ro - WQaoHooXoW/a]"^ po 



2 ' 2 



C(Q,A,6) = J]6ri [det(goo)det(Ai)]- 



1/2 



i=l 



Here Mii{Q, A, b, v) is the measure of Br under the considered class of distributions, in par- 
ticular, IX* [Br) = Mr{Q, B, a, 0), and C{Q, A, b) is the inverse of the normalising constant. 

Similarly, we obtain an upper bound on e"'*"*^'^^/'^^ for any x € Bs{x*). Denote Bu — 
Bu — 61D1, Hqo = Hqo — vBoiBi^Bio — SoDq, and 



6-1 



5o||-Bo2||2,oo + 5l||-Sl2||2,oo + y 1 1 -^22 1 1 l,oo + 5p252—^ 



for 5fe small enough so that Hqq and Bu are positive definite and 62 > 0. 
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Then, for any x e Bs{x*), 

exp {-[hy{x) - hy{x*)]/a^} dx < Jexp |||i?o-'/'i/oo^oir/(2^')} 

X exp [-Fv2 - \\Hil,\vo - H^,'HooXo/a)\\y2 - + V^B-^'Bom)\\V'^} dv. 
The posterior density normahsed by the posterior measure of Br can be written as 

p{S{x - x*) I Y)dx {-|l^oo^(^'o - HooHooXo/a)\\'^/2 - (fi + v^i?oo^5oiWo)||V2} dv 

p{Br\Y) - MR{H,B,b,iy)C{H,B,b) 

X exp{-Fv2}il + A2,oiR)), 



where 



A2,o(i?)) = 



exp{||^oo''''^ooXo||V(2^')} 

exp{||^oV^'//oo^o||V(2<7')} 

= exp{x^HooHoo[SoDo + vd^BoiB^^ D^B^^ B^q]H^^^ H^^xo/ a""} - 1 

< exp{[5oPo(C/3/3 + uCg^) + u5w^Cg^\\BoiB^^B^lB^o\\] H^^oo^oVll ll^oV^oo 



Van - 1. 



The total variation distance between the rescaled posterior distribution and its hmit, 
both truncated to Br, is bounded by 



< 2 



\\^{S{x-x*)\Y)'^Bn - IJ^'^BrWtV < 2 / 

JB 



Br 



ti*{B 



R) 



Br 1^*{Br) 

M{n,B,a,0)C{n,B,a 



p{v I Y)^%Br) 



MR{H,B,b, u)CiH,B,b) 



[nv)piBR I Y) 

(l + A2,o(i?)) 



- 1 



dv 



X 



exp [-blv2 - \\Hoi\vo - Hoo'HooXo/a)\\y2 - \\Bli\v, + ^500^501^^0)1172} ^ 



exp [-b^V2 - \\nli'{vo - aoW/2 - vlBnV,/2} 



dv 



Br 



fi^iBj, 



MR(Q,B,a, 0)C{n,B,a) 



exp {-[xQHooHQQHooXo/a'^ - ao^looao]/2} 



MR{H,B,b, u)C{H,B,b) 
X exp I^Jwa + VqDoVo/2 + 6i{vi + y/uBQQBoiVo)'^Di{vi + 0^5oo^5oiVo)/2| 
X exp {vo[HooXo/a - QooOo] } " l] + dv, 



where 



Do = SoDo-Uo[V'fy{x*)-V''fy^^^Jx*)]U^-u[Boo-Bo,B^^'B^o], 
h = -t/2^[V/,(x*)-V/,_,(a;*) + ^V</(a;*)] 

+ 1 



(^o||-So2||2,oo + 5l||-Sl2||2,oo + 1 1-^22 1 1 l,oo + ^^2^2 g 
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Since = ^ ^ and ^ ^ > C A/Var(Y), _Do is positive definite and ^2 > with high 
probabihty as cr — ?■ 0. 

Now we show that the ratio of the constants is greater than 1. Since function a~^(l — e~"^) 
decreases in a for large ax, we have that 



P2 

n 



^-1 



1 - exp{-biR2} 

1 - exp{-[6, + 62,i]R2} [bi + S2,^]-^ 



> 1. 



Also, we have Bu — Bu = 6iDi > 0, hence, by Interlacing Theorem (Lemma [8]), all 
eigenvalues of Bu are greater than the corresponding eigenvalues of Bu, and therefore 
det(i?ii) > det(i?ii), and the difference of arguments of the Gamma functions is given 
by 

{Bii)Rl Amm(-Bii)(-Ri — v^-Roll-Bii -Bioll)^ 
2 ~ 2 

= -6X{Dr)Rl/2 + V^R^R,\^^{B,,)c - zyi?2^IIlHi(|llM, 

which is positive if 5Ri <^ ^pOR^, i.e. if bb\ ^ (5o (recall that R\ ^ y/uRo). 

Since for large ax, a~^^'^T{ax \ (3) decreases in a, and, with high probability Aniin(^oo) < 
(iJoo - i'BqiB^IBiq + 5qDq), we have 



[det(l]oo)]-'/'r 



Amin(^00)[fi0-||a0||]^ I PO 



> 1 



[det(ifoo)]^"'"^^r ^ ''''^'"^^"o^[-^o~ll-^oo^-'^°p^°ll/°"^^ I PO 
with high probability. Then, 

MR{n,B,a,Q)C{n,B,a) ^ ^ 
MR{H,B,b,iy)C{H,B,b) ~ 

with high probability. Hence, 

(l + A2,o(i?)) 



F'(5(x-x*)|y)lsfl — TV < — 2 + 2- 

MniH,B,b,u)CiH,B,b) 

exp |-F||i;2||i - ||^oY,oi(^'oi + i/oi!oi^oi/f7)||V2} dv 

Br 

^ (l + A2,o(i?2) 

M{H,a, 6)C{H,a) 
2A2(i?), 



M{H,a,5)C{H,a) - 2 



where 



. M(^,B,a,^)C(ff,^,a) ^ _ ^_ 

M{H,B,a,5)C{H,B,a) 
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Note that M{H, a, 5) /M{H, B, a,5)<l with probabihty — )■ 1 as a — )■ 0, and 



C{H,B,a) 



P2 

n 

i=l 
po 

n 

i=l 



1 + 



26o 



K{Hoo — uBqiB^^ Bio + D 



00 J 



n 1/2 PI 



< 



1 + 



K{HoQ — uBqiB^^Bio — Dqq 

1 + 



n 

i=l 



26, 



X,iBn-SiDi] 
npi/2 



.1/2 



(Bu) - 6iDi 



(-f^OO — '^-Boi-Bii -Bio) — Amax(-D, 



00 J 



where Dqo = <5o-Do + i^SiBqiB^^ DiB^^ Bio and Aj(M) is the ith. largest eigenvalue of matrix 
M. Note that matrix Hoo ~ vBoiBiiBio is positive definite with high probability. 
Therefore, with high probability, 



A2(i?) < A*(/2)t/exp 
1 + 



6oXq HooHf^Q DqHqq HoqXo 



26o 



bmin — ^2 



P2 



(46) 



X 



2Amax(-Doo)) 



Po/2 



1 + 



25iDi, 



11 



(Bu) - 5iDi 

,11 



Pi/ 2 



Amin (-f^oo — J^BqiBh Bio) — Amax(-Doo))J 
The total variation distance between the limit measure and its truncation to -B/j is 
bounded by 



(41) 



R) 



< 2 



1 - nil - eM-W] + 1 - r (^=(|iiM|| 



+ 1-r 



Amin (fioo)(-Ro - ||ao||)^^'o 



The total variation distance between the posterior distribution and its truncation to Br 
is bounded by 

\\^{S{x-x*)\Y)iBii -'P(S(x-x*)\Y)\\tV < 2F(^s{x-x*)\Y){B'ii) 

Jx\B,{x*) exp{-(/iy(x) - hy{x''))/a'^}dx 
f.^ exp{ — {hy{x) — hy{x*)) / a"^} dx 
= 2Ao{Bs) <2A*{Bs), 

where Aq{Bs) is defined by f l5Ul) . 

Combining the three bounds, we have that 

\\^(S(x-x*)\Y) - f^^'Wrv < 2A5(i?) + 2/i^(5^) + 2A^(55), 
which gives the statement of the theorem. 

□ 
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